Synopsis Generator Notebook¶
Table of Contents¶
Introduction & Librairies ¶
Introduction ¶
What is this ?¶
- This is a potential creative app that uses generative models as a playground for users to generate overview of movies & tv show. It's mostly an app to learn about using and fine-tuning HuggingFace models like BERT & GPT-2.
What is this notebook ? Wouldn't python script be easier ?¶
- This is the Notebook used to make all the preparations to make a python backend for the app. Using a notebook makes it easier to experiment and also for the exploratory data analysis that'll be first made to explore the data we have to work with. Once experimentation is done, we can extract the code and put it in a lib that will be pushed into a git repository and used in the final demo & app.
Why this app/idea ?¶
- Because it enables me to learn more about how Hugging Face is used to produce models used in production. Hugging Face has become the face of NLP models and being able to use a pre-existing model and fine-tune it to fit a need is a skill that I wanted to train on.
Librairies, Config & APIs ¶
This is where all the import & config happens.
The biggest import are :
| Librairies | What it's used for |
|---|---|
| Pandas & Numpy | Process & Handling of the data |
| Pyplot, Plotly & Seaborn | Visualize the data |
| Tensorflow & Keras | Deep Learning backend & framework |
| Transformers & Diffusers | Deep Learning NLP models |
We also need to config the API key for TMDB to be able to scrap the database and an HF token to be able to push fine-tuned model to the Hugging Face Hub.
import os
import glob
import json
import requests
import calendar
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import tensorflow as tf
from transformers import AutoTokenizer, create_optimizer, AdamWeightDecay, DataCollatorWithPadding, TFAutoModelForSequenceClassification, TFAutoModelForCausalLM
from transformers.keras_callbacks import PushToHubCallback, KerasMetricCallback
from keras.callbacks import TensorBoard
import evaluate
from datasets import Dataset
from diffusers import DiffusionPipeline, LCMScheduler
HF_TOKEN = os.environ['HF_TOKEN']
API_KEY = os.environ["TMDB_API_KEY"]
API_VERSION = 3
API_BASE_URL = f"https://api.themoviedb.org/{API_VERSION}"
RANDOM_STATE = 21
2023-12-11 22:20:30.641197: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2023-12-11 22:20:30.877217: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2023-12-11 22:20:30.877256: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2023-12-11 22:20:30.878391: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-12-11 22:20:31.019250: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. /home/alel/Projets/Python/SOD/Projet_Final/Final_Project/.venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
import plotly
plotly.offline.init_notebook_mode()
Scrapping TMDB ¶
Using TMDB's API, we can scrape the database using a function that will cycle through every movies & tv shows id to get the details. Unfortunatly there's no way around this except choosing another database and TMDB doesn't offer images of their database, offering only an API service limited to 40 requests per second. This makes the scrapping a very heavy process. You can get the script here
from scripts.scrapper import scrapper_tmdb
## Movies
scrapper_tmdb('./data/jsons/04_11_2023/ids/movie_ids_04_11_2023.json', './data/jsons/04_11_2023/movie', 10, 'movie', API_BASE_URL, API_KEY)
## TV Shows
scrapper_tmdb('./data/jsons/04_11_2023/ids/tv_series_ids_04_11_2023.json', './data/jsons/04_11_2023/tv', 10, 'tv', API_BASE_URL, API_KEY)
Chapter 1 : Data Analysis & Vizualisation¶
Section 1.1 : Building the dataset ¶
After using the scrapper, we're left with multiple json that we can load with pandas using the function defined below
def get_json(file):
with open(file) as f:
data = json.load(f)
return data
def get_df_from_json_folder(filepath, cols=[]):
json_pattern = os.path.join(filepath, '*.json')
list_of_files = glob.glob(json_pattern)
list_of_files.sort(key=lambda x: int(
os.path.splitext(x)[0].split('-')[1]
))
dfs = [pd.DataFrame(get_json(file)).T for file in list_of_files]
return pd.concat(dfs, ignore_index=True)[cols] if cols else pd.concat(dfs, ignore_index=True)
series_columns = ['backdrop_path', 'created_by', 'episode_run_time', 'first_air_date', 'genres', 'in_production', 'languages', 'last_air_date', 'last_episode_to_air', 'name', 'networks', 'number_of_episodes', 'number_of_seasons', 'origin_country', 'original_language', 'original_name', 'overview', 'popularity', 'production_companies', 'production_countries', 'seasons', 'spoken_languages', 'status', 'tagline', 'type', 'vote_average', 'vote_count']
movies_columns = ['backdrop_path', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'original_language', 'original_title', 'overview', 'popularity', 'poster_path', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'vote_average', 'vote_count']
df_movies_raw = get_df_from_json_folder('./data/jsons/04_11_2023/movie', movies_columns)
df_series_raw = get_df_from_json_folder('./data/jsons/04_11_2023/tv', series_columns)
Section 2.2 : EDA ¶
Preprocessing data for EDA, we start by filtering columns that aren't useful for data analysis and revealing where the NaNs values are.
def preprocess_df(df_raw, col_list, col_process_list):
result_df = df_raw[col_list].copy()
result_df = result_df.mask(result_df == '').mask(result_df.map(str).eq('[]'))
for col in col_process_list:
result_df[col] = result_df[col].apply(lambda row : [dicts['name'] for dicts in row] if type(row) == list else row)
return result_df
movies_eda_columns = ['belongs_to_collection', 'budget', 'genres', 'homepage', 'original_language', 'original_title', 'overview', 'popularity', 'production_companies', 'production_countries', 'release_date', 'revenue', 'runtime', 'spoken_languages', 'status', 'tagline', 'title', 'vote_average', 'vote_count']
movies_cols_to_process = ['genres', 'production_companies', 'production_countries']
df_movies = preprocess_df(df_movies_raw, movies_eda_columns, movies_cols_to_process)
df_movies['belongs_to_collection'] = df_movies['belongs_to_collection'].fillna(value=np.nan).apply(lambda row: row['name'] if type(row) == dict else row)
series_eda_columns = ['created_by', 'episode_run_time', 'first_air_date', 'genres', 'in_production', 'languages', 'last_air_date', 'name', 'networks', 'number_of_episodes', 'number_of_seasons', 'origin_country', 'original_language', 'original_name', 'overview', 'popularity', 'production_companies', 'production_countries', 'seasons', 'spoken_languages', 'status', 'tagline', 'type', 'vote_average', 'vote_count']
series_cols_to_process = ['created_by', 'genres', 'networks', 'production_companies', 'production_countries', 'seasons']
df_series = preprocess_df(df_series_raw, series_eda_columns, series_cols_to_process)
for df in [df_movies, df_series]:
df['spoken_languages'] = df['spoken_languages'].apply(lambda row: [dicts['english_name'] for dicts in row] if type(row) == list else row)
We can now look at the data once it's been preprocessed
print(df_movies.shape)
df_movies.head(2)
(859971, 19)
| belongs_to_collection | budget | genres | homepage | original_language | original_title | overview | popularity | production_companies | production_countries | release_date | revenue | runtime | spoken_languages | status | tagline | title | vote_average | vote_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Blondie Collection | 0 | [Comedy, Family] | NaN | en | Blondie | Blondie and Dagwood are about to celebrate the... | 2.5 | [Columbia Pictures] | [United States of America] | 1938-11-30 | 0 | 70 | [English] | Released | The favorite comic strip of millions at last o... | Blondie | 7.2 | 7 |
| 1 | NaN | 0 | [Adventure] | NaN | de | Der Mann ohne Namen | NaN | 1.091 | NaN | [Germany] | 1921-01-01 | 0 | 420 | NaN | Released | NaN | Peter Voss, Thief of Millions | 0.0 | 0 |
print(df_series.shape)
df_series.head(2)
(158626, 25)
| created_by | episode_run_time | first_air_date | genres | in_production | languages | last_air_date | name | networks | number_of_episodes | ... | popularity | production_companies | production_countries | seasons | spoken_languages | status | tagline | type | vote_average | vote_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | [60] | 2004-01-12 | [Drama] | False | [ja] | 2004-03-22 | Pride | [Fuji TV] | 11 | ... | 18.171 | NaN | [Japan] | [Season 1] | [Japanese] | Ended | NaN | Scripted | 8.2 | 13 |
| 1 | [Kevin Smith, Scott Mosier, David Mandel] | [22] | 2000-05-31 | [Animation, Comedy] | False | [en] | 2002-12-22 | Clerks | [ABC, Comedy Central] | 6 | ... | 31.201 | [Touchstone Television, View Askew Productions... | [United States of America] | [Specials, Season 1] | [English] | Canceled | NaN | Scripted | 7.012 | 86 |
2 rows × 25 columns
Now one important thing to look for is NaN values, we used the preprocessing step to highlight NaNs from void values (like empty lists or strings) so we can visualize them in a proper way, we can for that use a barchart or a heatmap
nan_plot_dict = {
"Movies" : df_movies.isna().T,
"TV Show" : df_series.isna().T
}
plt.subplots(figsize=(15,5))
plt.axis('off')
plt.suptitle('Missing data in DFs')
index = 1
for k,v in nan_plot_dict.items():
plt.subplot(1,2,index)
plt.title(k)
sns.heatmap(v, cbar=False)
index += 1
plt.tight_layout()
plt.savefig('./viz/nan_heatmap.png')
nan_percentage_dict = {
"Movies" : df_movies.isna().sum().sort_values(ascending=True) / len(df_movies) * 100,
"TV Show" : df_series.isna().sum().sort_values(ascending=True) / len(df_series) * 100
}
plt.subplots(figsize=(15,5))
plt.axis('off')
plt.suptitle('NaN percentage in DFs')
index = 1
for k,v in nan_percentage_dict.items():
plt.subplot(1,2,index)
plt.title(k)
v.plot(kind='barh')
index += 1
plt.tight_layout()
plt.savefig('./viz/nan_percent.png')
Now we can see that some columns have a lot of missing data. What we're interested in for generating overview & classifying genres are the overview & genres features. We're missing a lot of overview & genres in TV show and a little bit less overview are missing in movies. We can therefore see if we can train a model to infer those missing genres since trying to generate convincing overview based on title & genres seems a bit too difficult.
We can now try to have some general visualization of the data like the language
plot_dict_movies = {
"Original languages of movies" :
df_movies['original_language'].value_counts().head().sort_values(ascending=True),
"Movies genres" :
df_movies['genres'].value_counts().head().sort_values(ascending=True),
"Original languages of tv shows" :
df_series['original_language'].value_counts().head().sort_values(ascending=True),
"TV Shows genres" :
df_series['genres'].value_counts().head().sort_values(ascending=True)
}
plt.subplots(figsize=(15,10))
plt.axis('off')
index = 1
for key, value in plot_dict_movies.items():
plt.subplot(2,2, index)
plt.title(key)
value.plot(kind='barh')
index += 1
plt.savefig('./viz/languages_genres.png')
df_movies['release_date'] = pd.to_datetime(df_movies['release_date'])
df_series['first_air_date'] = pd.to_datetime(df_series['first_air_date'], errors='coerce')
df_movies['release_year'] = df_movies['release_date'].dt.year
df_movies['release_month'] = df_movies['release_date'].dt.month
df_series['first_air_year'] = df_series['first_air_date'].dt.year
df_series['first_air_month'] = df_series['first_air_date'].dt.month
print('TV Shows Air Date NaNs :', (df_series['first_air_date'].dt.year).isna().sum(), '\nMovies Air Date NaNs :', (df_movies['release_date'].dt.year).isna().sum())
TV Shows Air Date NaNs : 30083 Movies Air Date NaNs : 79628
plot_dict = {
"TV Show" : df_series['first_air_date'].dt.year.value_counts().head(10),
"Movies" : df_movies['release_date'].dt.year.value_counts().head(10)
}
plt.subplots(figsize=(15,5))
plt.axis('off')
plt.suptitle('Most Release Per Year in DFs')
index = 1
for k,v in plot_dict.items():
plt.subplot(1,2,index)
plt.title(k)
v.plot(kind='bar')
index += 1
plt.savefig('./viz/release_year.png')
start_year, end_year = 2019, 2023
year_list = [i for i in range(start_year, end_year + 1)]
plot_list = [[df_movies, 'release_date'], [df_series, 'first_air_date']]
title_list = ['Movies', 'TV Show']
def get_count_by_month(df, col, year):
result = df[df[col].dt.year == year][col].dt.strftime("%b").value_counts().reindex(calendar.month_abbr[1:])
return result
plt.subplots(figsize=(20,5))
plt.axis('off')
plt.suptitle('Release by month from :')
for index in range(2):
plt.subplot(1, 2, index + 1)
plt.title(title_list[index])
for year in year_list:
get_count_by_month(plot_list[index][0], plot_list[index][1], year).plot(marker='o')
plt.legend(year_list[:])
plt.savefig('./viz/release_month.png')
plots = [df_movies[df_movies['runtime'] != 0]['runtime'],
df_series['episode_run_time']]
index = 1
plt.subplots(figsize=(15,5))
plt.axis('off')
for each in plots:
plt.subplot(1,2,index)
each.value_counts().head().plot(kind='barh')
index += 1
plt.savefig('./viz/runtime.png')
plot_dict = {
"TV Show per countries" : df_series['production_countries'].explode().dropna().value_counts(),
"Movies per countries" : df_movies['production_countries'].explode().dropna().value_counts()
}
for title, plot in plot_dict.items():
fig = go.Figure(data=go.Choropleth(locationmode='country names', locations=plot.index.values, text=plot.index, z=plot.values, colorscale = 'Greys'))
fig.update_layout(height=600, width=800, title_text=title,title_x=0.5)
fig.show()
Chapter 2 : Models ¶
The app has 2 principal functionnality that LLMs are able to help us with :
- Classification overviews into genre :
- The user write an overview and the model tells us which genre this overview fits into
- Generation of overviews from genres & title :
- The user write a title & select genres, the model give us an overview that fits the title & genres
Section 2.1 : Classification ¶
Classify genres based on overview, the model we'll use is a simple random forest which is nice for categorical data
cols = ['genres', 'overview', 'title']
df_clf = pd.concat([
df_movies[df_movies['overview'].notna()],
df_series[df_series['overview'].notna()].rename(columns={'name' : 'title'})
], ignore_index=True)[cols].rename(columns={'genres' : 'label'})
def filter_dominant_genres(genres_list):
dominants_genres_list = ['Drama', 'Documentary', 'Comedy', 'Animation', 'Horror']
if len(genres_list) > 1 and genres_list[0] in dominants_genres_list:
processed_genres = [genre for genre in genres_list if genre not in dominants_genres_list]
result = processed_genres[0] if len(processed_genres) > 0 else genres_list[0]
return result
else:
return genres_list[0]
replace_dict = {
'Sci-Fi & Fantasy' : 'Science Fiction',
'War & Politics' : 'History',
'Musical' : 'Music',
'News' : 'Reality',
'Talk' : 'Reality',
'Soap' : 'Drama',
'Kids' : 'Family',
'Action & Adventure' : 'Adventure',
'War' : 'History'
}
df_clf['label'] = df_clf['label'].apply(lambda row : filter_dominant_genres(row) if type(row) == list else row).replace(replace_dict)
df_clf['text'] = df_clf['title'] + ' | ' + df_clf['overview']
candidates = df_clf['label'].value_counts().index.to_list()
df_clf_to_train = df_clf[~df_clf['label'].isna()].copy()
df_clf_to_train.shape
(571212, 4)
genre_index = df_clf_to_train['label'].factorize()[1]
df_clf_to_train['label'] = df_clf_to_train['label'].factorize()[0]
df_clf_to_train.head()
| label | overview | title | text | |
|---|---|---|---|---|
| 0 | 0 | Blondie and Dagwood are about to celebrate the... | Blondie | Blondie | Blondie and Dagwood are about to cel... |
| 1 | 1 | Love at Twenty unites five directors from five... | Love at Twenty | Love at Twenty | Love at Twenty unites five di... |
| 3 | 0 | Elmo is making a very, very super special surp... | Sesame Street: Elmo Loves You! | Sesame Street: Elmo Loves You! | Elmo is makin... |
| 4 | 1 | After the coal mine he works at closes and his... | Ariel | Ariel | After the coal mine he works at closes... |
| 5 | 1 | Nikander, a rubbish collector and would-be ent... | Shadows in Paradise | Shadows in Paradise | Nikander, a rubbish coll... |
tokenizer_clf = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def tokenize_classifier(examples):
return tokenizer_clf(examples["text"], truncation=True)
cols = ['text', 'label']
raw_datasets = Dataset.from_pandas(df_clf_to_train[cols], preserve_index=True).train_test_split(seed=RANDOM_STATE)
tokenized_dataset = raw_datasets.map(tokenize_classifier, batched=True, remove_columns=['__index_level_0__'])
data_collator = DataCollatorWithPadding(tokenizer=tokenizer_clf, return_tensors="tf")
Map: 0%| | 0/428409 [00:00<?, ? examples/s]
Map: 0%| | 0/142803 [00:00<?, ? examples/s]
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return accuracy.compute(predictions=predictions, references=labels)
n_label = len(genre_index)
id2label = {id:genre_index[id] for id in range(n_label)}
label2id = {genre_index[id]:id for id in range(n_label)}
batch_size = 16
num_epochs = 3
batches_per_epoch = len(tokenized_dataset["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)
model_clf = TFAutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased", num_labels=n_label, id2label=id2label, label2id=label2id
)
tf_train_set = model_clf.prepare_tf_dataset(
tokenized_dataset["train"],
shuffle=True,
batch_size=8,
collate_fn=data_collator,
)
tf_validation_set = model_clf.prepare_tf_dataset(
tokenized_dataset["test"],
shuffle=False,
batch_size=8,
collate_fn=data_collator,
)
model_clf.compile(optimizer=optimizer)
metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)
push_to_hub_callback = PushToHubCallback(
output_dir="overview_classifier_final",
tokenizer=tokenizer_clf,
)
tensorboard_callback = TensorBoard(log_dir="./overview_classifier_final/logs")
callbacks = [metric_callback, tensorboard_callback, push_to_hub_callback]
model_clf.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=callbacks)
2023-11-25 00:08:41.652008: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node Your kernel may have been built without NUMA support. 2023-11-25 00:08:41.671413: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node Your kernel may have been built without NUMA support. 2023-11-25 00:08:41.671587: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node Your kernel may have been built without NUMA support. 2023-11-25 00:08:41.675123: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node Your kernel may have been built without NUMA support. 2023-11-25 00:08:41.675337: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node Your kernel may have been built without NUMA support. 2023-11-25 00:08:41.675458: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node Your kernel may have been built without NUMA support. 2023-11-25 00:08:41.953050: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node Your kernel may have been built without NUMA support. 2023-11-25 00:08:41.953470: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node Your kernel may have been built without NUMA support. 2023-11-25 00:08:41.953486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1977] Could not identify NUMA node of platform GPU id 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2023-11-25 00:08:41.953669: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node Your kernel may have been built without NUMA support. 2023-11-25 00:08:41.953964: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5606 MB memory: -> device: 0, name: NVIDIA GeForce RTX 3060 Ti, pci bus id: 0000:01:00.0, compute capability: 8.6 2023-11-25 00:08:46.371265: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: Permission denied Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight'] - This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model). Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding. /home/alel/Projets/Python/SOD/Projet_Final/overview_classifier_final is already a clone of https://huggingface.co/Alirani/overview_classifier_final. Make sure you pull the latest changes with `repo.git_pull()`.
Epoch 1/3
2023-11-25 00:08:58.792376: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1fea6190 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2023-11-25 00:08:58.792414: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA GeForce RTX 3060 Ti, Compute Capability 8.6 2023-11-25 00:08:58.799206: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable. 2023-11-25 00:08:58.850631: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:442] Loaded cuDNN version 8902 2023-11-25 00:08:58.910458: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
53551/53551 [==============================] - 6487s 121ms/step - loss: 1.4370 - val_loss: 1.3354 - accuracy: 0.5644 Epoch 2/3 53551/53551 [==============================] - 5988s 112ms/step - loss: 1.1724 - val_loss: 1.3251 - accuracy: 0.5701 Epoch 3/3 53551/53551 [==============================] - 5753s 107ms/step - loss: 1.1282 - val_loss: 1.3251 - accuracy: 0.5701
<keras.src.callbacks.History at 0x7f32a04d4df0>
Section 2.2 : Generation ¶
Section 2.2.1 : Synopsis Generation ¶
To generate synopsis, we will fine-tune a GPT-2 model using the title, genre and overview of either a movie or tv show and check how it behave. We will use the classifier we just fine-tuned to predict missing genres.
First we get a dataframe with only missing genres
df_clf_to_pred = df_clf[df_clf['label'].isna()].dropna(subset=['title']).copy()
df_clf_to_pred.shape
(230912, 4)
We load the model we just pushed to the hugging_face hub so we're sure we always use the latest version and we make a function that'll infer the genre of a title + overview input
def get_pred_genre(input, tokenizer, model):
tokenized_input = tokenizer(input, max_length=512, truncation=True, padding='max_length', return_tensors="tf")
logits = model(**tokenized_input).logits
predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])
return model.config.id2label[predicted_class_id]
df_clf_to_pred['label'] = df_clf_to_pred.apply(lambda row: get_pred_genre(row['text'], tokenizer_clf, model_clf), axis=1)
df_clf_to_pred['text'] = df_clf_to_pred['title'] + ' | ' + df_clf_to_pred['label'] + ' | ' + df_clf_to_pred['overview']
df_clf_to_train['label'] = df_clf_to_train['label'].replace(id2label)
df_clf_to_train['text'] = df_clf_to_train['title'] + ' | ' + df_clf_to_train['label'] + ' | ' + df_clf_to_train['overview']
cols = ['genres', 'text']
df_gen = pd.concat([df_clf_to_train, df_clf_to_pred])
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model_gen = TFAutoModelForCausalLM.from_pretrained(model_name)
All PyTorch model weights were used when initializing TFGPT2LMHeadModel. All the weights of TFGPT2LMHeadModel were initialized from the PyTorch model. If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.
def generate_synopsis(model, tokenizer, title):
input_ids = tokenizer(title, return_tensors="tf")
output = model.generate(input_ids['input_ids'], max_length=150, num_beams=5, no_repeat_ngram_size=2, top_k=50, attention_mask=input_ids['attention_mask'])
synopsis = tokenizer.decode(output[0], skip_special_tokens=True)
return synopsis
prompt = "Blondie | Family | "
print(f"Model output before fine-tuning: {generate_synopsis(model_gen, tokenizer, prompt)}\nWhat we're expecting : {df_gen.iloc[0]['text']}")
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Model output before fine-tuning: Blondie | Family | __________________ What we're expecting : Blondie | Family | Blondie and Dagwood are about to celebrate their fifth wedding anniversary but this happy occasion is marred when the bumbling Dagwood gets himself involved in a scheme that is promising financial ruin for the Bumstead family.
We can see the model's result aren't what we're expecting, but now we can fine-tune and see if it improves our model output
def tokenize_generator(examples):
return tokenizer(examples["text"])
raw_datasets = Dataset.from_pandas(df_gen, preserve_index=True).train_test_split(seed=RANDOM_STATE)
tokenized_datasets = raw_datasets.map(
tokenize_generator, batched=True, remove_columns=['title', 'overview', 'text', 'label', '__index_level_0__']
)
def group_texts(examples, block_size = 128):
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
total_length = (total_length // block_size) * block_size
result = {
k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}
result["labels"] = result["input_ids"].copy()
return result
Map: 0%| | 0/601593 [00:00<?, ? examples/s]
Token indices sequence length is longer than the specified maximum sequence length for this model (1076 > 1024). Running this sequence through the model will result in indexing errors
Map: 0%| | 0/200531 [00:00<?, ? examples/s]
lm_datasets = tokenized_datasets.map(
group_texts,
batched=True,
batch_size=1000,
)
optimizer = AdamWeightDecay(lr=2e-5, weight_decay_rate=0.01)
model_gen.compile(optimizer=optimizer)
train_set = model_gen.prepare_tf_dataset(
lm_datasets["train"],
shuffle=True,
batch_size=8,
)
validation_set = model_gen.prepare_tf_dataset(
lm_datasets["test"],
shuffle=False,
batch_size=8,
)
tensorboard_callback = TensorBoard(log_dir="./distilgpt2-finetuned-synopsis-genres_final/logs")
push_to_hub_callback = PushToHubCallback(
output_dir="./distilgpt2-finetuned-synopsis-genres_final",
tokenizer=tokenizer
)
callbacks = [tensorboard_callback, push_to_hub_callback]
model_gen.fit(train_set, validation_data=validation_set, epochs=4, callbacks=callbacks)
model_gen.save_pretrained("./data/model/distilgpt2-finetuned-synopsis-genres_final")
Map: 0%| | 0/601593 [00:00<?, ? examples/s]
Map: 0%| | 0/200531 [00:00<?, ? examples/s]
/home/alel/.local/lib/python3.10/site-packages/keras/src/optimizers/legacy/adam.py:118: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead. super().__init__(name, **kwargs) Cloning https://huggingface.co/Alirani/distilgpt2-finetuned-synopsis-genres_final into local empty directory. WARNING:huggingface_hub.repository:Cloning https://huggingface.co/Alirani/distilgpt2-finetuned-synopsis-genres_final into local empty directory.
Epoch 1/4
6/40901 [..............................] - ETA: 1:20:22 - loss: 4.5849WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0315s vs `on_train_batch_end` time: 0.0793s). Check your callbacks.
WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0315s vs `on_train_batch_end` time: 0.0793s). Check your callbacks.
40901/40901 [==============================] - 5120s 125ms/step - loss: 4.0094 - val_loss: 3.8307 Epoch 2/4 40901/40901 [==============================] - 5655s 138ms/step - loss: 3.8810 - val_loss: 3.7803 Epoch 3/4 40901/40901 [==============================] - 6510s 159ms/step - loss: 3.8211 - val_loss: 3.7502 Epoch 4/4 40901/40901 [==============================] - 6037s 148ms/step - loss: 3.7786 - val_loss: 3.7310
Section 3.2.2 : Poster Generation ¶
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0").to("cuda")
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl")
results = pipe(
prompt="The poster of a movie called Blondie",
num_inference_steps=4,
guidance_scale=0.0,
)
results.images[0]
Loading pipeline components...: 100%|██████████| 7/7 [00:01<00:00, 5.45it/s]
The config attributes {'skip_prk_steps': True} were passed to LCMScheduler, but are not expected and will be ignored. Please verify your scheduler_config.json configuration file.
100%|██████████| 4/4 [14:32<00:00, 218.20s/it]
Chapter 3 : App ¶
1.1 - Search
from scripts.queries import search_query
results = search_query('movie', 'Batman', API_BASE_URL, API_KEY)
print("réponse : ", results.status_code, "\n output : ")
pd.DataFrame().from_dict(json.loads(results.text)['results']).head()
réponse : 200 output :
| adult | backdrop_path | genre_ids | id | original_language | original_title | overview | popularity | poster_path | release_date | title | video | vote_average | vote_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | False | /frDS8A5vIP927KYAxTVVKRIbqZw.jpg | [14, 28, 80] | 268 | en | Batman | Batman must face his most ruthless nemesis whe... | 41.401 | /cij4dd21v2Rk2YtUQbV5kW69WB2.jpg | 1989-06-21 | Batman | False | 7.220 | 7243 |
| 1 | False | /bxxupqG6TBLKC60M6L8iOvbQEr6.jpg | [28, 35, 80] | 2661 | en | Batman | The Dynamic Duo faces four super-villains who ... | 20.018 | /zzoPxWHnPa0eyfkMLgwbNvdEcVF.jpg | 1966-07-30 | Batman | False | 6.301 | 791 |
| 2 | False | /xEG5iP1qZCiDt4BefSpLy1d54zE.jpg | [28, 12, 80, 878, 53, 10752] | 125249 | en | Batman | Japanese master spy Daka operates a covert esp... | 10.790 | /AvzD3mrtokIzZOiV6zAG7geIo6F.jpg | 1943-07-16 | Batman | False | 6.400 | 59 |
| 3 | False | /p2aiSLQZx7AVZrY9cfOOPv1u5Zk.jpg | [27, 53, 878] | 1160196 | en | Batman | A young man learns the consequence of tempting... | 3.991 | /qIvTMHX2MIYG2Ij4jP5dkKgMqUo.jpg | 2023-07-28 | Batman | False | 6.000 | 3 |
| 4 | False | /tRS6jvPM9qPrrnx2KRp3ew96Yot.jpg | [80, 9648, 53] | 414906 | en | The Batman | In his second year of fighting crime, Batman u... | 136.362 | /74xTEgt7R36Fpooo50r9T25onhq.jpg | 2022-03-01 | The Batman | False | 7.701 | 8833 |
1.2 - Details
from scripts.queries import get_details
result = get_details('movie', 299054, API_BASE_URL, API_KEY)
print("réponse : ", result.status_code, "\n output : ")
pd.DataFrame().from_dict(json.loads(result.text), orient='index').T
réponse : 200 output :
| adult | backdrop_path | belongs_to_collection | budget | genres | homepage | id | imdb_id | original_language | original_title | ... | release_date | revenue | runtime | spoken_languages | status | tagline | title | video | vote_average | vote_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | False | /j9LX1sF7WSXmJlnhf0RGpWzEC0i.jpg | {'id': 126125, 'name': 'The Expendables Collec... | 100000000 | [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam... | https://expendables.movie/ | 299054 | tt3291150 | en | Expend4bles | ... | 2023-09-15 | 58000000 | 103 | [{'english_name': 'English', 'iso_639_1': 'en'... | Released | They'll die when they're dead. | Expend4bles | False | 6.428 | 824 |
1 rows × 25 columns
1.3 - Trending
from scripts.queries import get_trendings
result = get_trendings(1, 'day', API_BASE_URL, API_KEY)
print("réponse : ", result.status_code, "\n output : ", pd.DataFrame().from_dict(json.loads(result.text)['results']).shape)
pd.DataFrame().from_dict(json.loads(result.text)['results']).head()
réponse : 200 output : (20, 15)
| adult | backdrop_path | id | name | original_language | original_name | overview | poster_path | media_type | genre_ids | popularity | first_air_date | vote_average | vote_count | origin_country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | False | /wyLHV7oP0O88aVFFkS2Ue71Of6f.jpg | 96648 | Sweet Home | ko | 스위트홈 | As humans turn into savage monsters and wreak ... | /u8sLAJUvY9yzWqtVfKRQz5yin3D.jpg | tv | [18, 10765] | 315.221 | 2020-12-18 | 8.400 | 1089 | [KR] |
| 1 | False | /jEDILaZtJOqNTEnFqWnYsCEVHpr.jpg | 94244 | Obliterated | en | Obliterated | A special forces team thwarts a deadly plot in... | /5g3UrcV6oguAcI3myMKb6wi28y5.jpg | tv | [35, 10759] | 121.012 | 2023-11-30 | 7.917 | 12 | [US] |
| 2 | False | /vcFW09U4834DyFOeRZpsx9x1D3S.jpg | 57243 | Doctor Who | en | Doctor Who | The Doctor is a Time Lord: a 900 year old alie... | /4edFyasCrkH4MKs6H4mHqlrxA6b.jpg | tv | [10759, 18, 10765] | 533.627 | 2005-03-26 | 7.460 | 2703 | [GB] |
| 3 | False | /oT81JufYbkP9BkFZm32VwvXRBOc.jpg | 239770 | Doctor Who | en | Doctor Who | The Doctor and friends travel from the dawn of... | /2I8aMfUvgRKQvEpBIQVKMbXgMsi.jpg | tv | [10759, 18, 10765] | 111.138 | 7.620 | 25 | [GB] | |
| 4 | False | /2bzS31ujJhUlKzXrU5nQ2OiV1G9.jpg | 202411 | Monarch: Legacy of Monsters | en | Monarch: Legacy of Monsters | After surviving Godzilla's attack on San Franc... | /uwrQHMnXD2DA1rvaMZk4pavZ3CY.jpg | tv | [18, 10765, 10759] | 1436.206 | 2023-11-16 | 8.267 | 195 | [US] |
1.4 - Top-rated
from scripts.queries import get_top_rated
results = get_top_rated("tv", 1, API_BASE_URL, API_KEY)
print("réponse : ", results.status_code, "\n output : ")
pd.DataFrame().from_dict(json.loads(results.text)['results']).head()
réponse : 200 output :
| adult | backdrop_path | genre_ids | id | origin_country | original_language | original_name | overview | popularity | poster_path | first_air_date | name | vote_average | vote_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | False | /9faGSFi5jam6pDWGNd0p8JcJgXQ.jpg | [18, 80] | 1396 | [US] | en | Breaking Bad | When Walter White, a New Mexico chemistry teac... | 354.925 | /3xnWaLQjelJDDF7LT1WBo6f4BRe.jpg | 2008-01-20 | Breaking Bad | 8.900 | 12708 |
| 1 | False | /rkB4LyZHo1NHXFEDHl9vSD9r1lI.jpg | [16, 18, 10765, 10759] | 94605 | [US] | en | Arcane | Amid the stark discord of twin cities Piltover... | 91.310 | /fqldf2t8ztc9aiwn3k6mlX3tvRT.jpg | 2021-11-06 | Arcane | 8.743 | 3450 |
| 2 | False | /a6ptrTUH1c5OdWanjyYtAkOuYD0.jpg | [10759, 35, 16] | 37854 | [JP] | ja | ワンピース | Years ago, the fearsome Pirate King, Gol D. Ro... | 100.373 | /e3NBGiAifW9Xt8xD5tpARskjccO.jpg | 1999-10-20 | One Piece | 8.725 | 4187 |
| 3 | False | /rBF8wVQN8hTWHspVZBlI3h7HZJ.jpg | [16, 35, 10765, 10759] | 60625 | [US] | en | Rick and Morty | Rick is a mentally-unbalanced but scientifical... | 924.938 | /gdIrmf2DdY5mgN6ycVP0XlzKzbE.jpg | 2013-12-02 | Rick and Morty | 8.700 | 8839 |
| 4 | False | /A6tMQAo6t6eRFCPhsrShmxZLqFB.jpg | [10759, 16, 10765] | 31911 | [JP] | ja | 鋼の錬金術師 FULLMETAL ALCHEMIST | Disregard for alchemy’s laws ripped half of Ed... | 157.214 | /5ZFUEOULaVml7pQuXxhpR2SmVUw.jpg | 2009-04-05 | Fullmetal Alchemist: Brotherhood | 8.694 | 1804 |
Created by